Store subword indices for quality scores in Response #357

jelmervdl · 2022-02-17T23:01:29Z

Fixes #355

This change stores the quality score per subword range instead of byte range in Response. When inserting HTML into the translation output, the byte offsets change but the subword offsets don't.

I added a type for SubwordRange (previously just an alias for ByteRange) to avoid any confusion between the two. Now you can't compare or interchange them, even though they're both just ranges.

The WASM bindings still produce a SentenceQualityScore that works exactly the same as now, i.e. with ByteRanges. In #355 I mused that Javascript could access the byteranges per word through response.target.wordAsByteRange() but neither response.target nor the type AnnotatedText are currently exposed through the bindings. So to not introduce much more new API at this point, I made getQualityScores() return the same thing it always did.

Downside of this approach is that you'll need to delete() that returned object in Javascript land. I've changed worker.js to do this.

The bindings still expose the byteranges as they previously did, because otherwise I would also have to expose AnnotatedText etc. which is apparently not something we're doing right now. Because the bindings create a new score vector on the fly, you'll have to delete it when you no longer use it, unfortunately.

jerinphilip

Leaving a few queries below. Mozilla and QE have negotiated some properties (meaningful units of QE = word, etc.). There was some threshold based operation at extension discussion as well. This was before HTML, the assumptions were plaintext QE - so we may revisit.

Requesting @mfomicheva, @abarbosa94 and @felipesantosk to take a look, since we're modifying the parts.

jerinphilip · 2022-02-18T00:03:22Z

src/tests/common-impl.cpp

-    for (const auto &wordByteRange : sentenceQualityEstimate.wordByteRanges) {
+    for (const auto &wordRange : sentenceQualityEstimate.wordRanges) {
+      const ByteRange wordByteRange{response.target.wordAsByteRange(sentenceIdx, wordRange.begin).begin,
+                                    response.target.wordAsByteRange(sentenceIdx, wordRange.end).begin};


Is there a possibility wordByteRange ends up including an HTML opening/closing tag if two subwords end up with an HTML something in between?

Yes. They are completely unsynchronized and they have to be for better or worse. I imagine HTML rendering will use something like #358 by copying the QE score tags.

jerinphilip · 2022-02-18T00:18:15Z

src/translator/quality_estimator.cpp

-  }
-
-  return words;
-}


What's happening here? There was a "words as meaningful QE units" notion before, are we losing this with the HTML integrated QE?

The QualityEstimator class still works with this assumption, subwords are grouped into words in mapWords:

bergamot-translator/src/translator/quality_estimator.cpp

Lines 241 to 268 in bce30cf

std::vector<SubwordRange> mapWords(const std::vector<float>& logProbs, const AnnotatedText& target,

const size_t sentenceIdx) {

// Ignore empty target

if ((logProbs.size() < 2) || (target.numWords(sentenceIdx) == 0)) {

return {};

}

// It is expected that translated words will have at least one word

std::vector<SubwordRange> wordIndices(/*numWords=*/1);

/// The LogisticRegressorQualityEstimator model ignores the presence of the EOS token, and hence we only need to

/// iterate n-1 positions.

for (size_t subwordIdx = 0; subwordIdx < (logProbs.size() - 1); ++subwordIdx) {

ByteRange subword = target.wordAsByteRange(sentenceIdx, subwordIdx);

const char firstLetter = target.text.at(subword.begin);

// if the first character is whitespace, it's a beginning of a new word

if (isspace(firstLetter)) {

wordIndices.back().end = subwordIdx;

wordIndices.emplace_back();

wordIndices.back().begin = subwordIdx;

}

}

wordIndices.back().end = logProbs.size() - 1;

return wordIndices;

}

The bit of code I removed here turned those groupings into a single byterange. That has been moved to the WASM bindings bit since that's the last possible moment to do it.

I think most clients will find it useful to do this themselves by accessing the AnnotatedText's data. But since we don't expose those to Javascript, and I didn't want to start a half-assed attempt at reworking the Javascript API, I moved the functionality above to here:

bergamot-translator/wasm/bindings/response_bindings.cpp

Lines 33 to 52 in bce30cf

std::vector<SentenceQualityScore> getQualityScores(const Response& response) {

std::vector<SentenceQualityScore> scores;

scores.reserve(response.qualityScores.size());

for (size_t sentenceIdx = 0; sentenceIdx < response.qualityScores.size(); ++sentenceIdx) {

std::vector<ByteRange> wordByteRanges;

wordByteRanges.reserve(response.qualityScores[sentenceIdx].wordRanges.size());

for (auto&& word : response.qualityScores[sentenceIdx].wordRanges) {

wordByteRanges.emplace_back();

wordByteRanges.back().begin = response.target.wordAsByteRange(sentenceIdx, word.begin).begin;

wordByteRanges.back().end = response.target.wordAsByteRange(sentenceIdx, word.end).begin;

}

scores.emplace_back(SentenceQualityScore{response.qualityScores[sentenceIdx].wordScores, std::move(wordByteRanges),

response.qualityScores[sentenceIdx].sentenceScore});

}

return scores;

}

Note that this does not work for HTML as the closing tags of the previous word will be placed before the spaces, so the spaces won't be skipped unfortunately.

jelmervdl · 2022-02-25T10:51:52Z

Closing this in favour of #358 (which also includes part of this pull request)

jelmervdl requested review from jerinphilip and abhi-agg February 17, 2022 23:01

Fix tests implementation

bce30cf

jelmervdl mentioned this pull request Feb 17, 2022

Embed quality-scores as HTML tag attributes #358

Merged

jerinphilip reviewed Feb 18, 2022

View reviewed changes

kpu mentioned this pull request Feb 18, 2022

QE for HTML input doesn't work #355

Closed

jelmervdl self-assigned this Feb 18, 2022

jelmervdl added 2 commits February 21, 2022 10:46

Match tight wrapping around words of original implementation

f760e0f

Note that this does not work for HTML as the closing tags of the previous word will be placed before the spaces, so the spaces won't be skipped unfortunately.

Use SubwordRange -> ByteRange conversion for both bindings and tests

d4d625d

jelmervdl closed this Feb 25, 2022

jelmervdl deleted the qe-for-html branch March 2, 2022 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store subword indices for quality scores in Response #357

Store subword indices for quality scores in Response #357

jelmervdl commented Feb 17, 2022

jerinphilip left a comment

jerinphilip Feb 18, 2022 •

edited

Loading

kpu Feb 18, 2022

jerinphilip Feb 18, 2022

jelmervdl Feb 18, 2022

jelmervdl commented Feb 25, 2022

	std::vector<SubwordRange> mapWords(const std::vector<float>& logProbs, const AnnotatedText& target,
	const size_t sentenceIdx) {
	// Ignore empty target
	if ((logProbs.size() < 2) \|\| (target.numWords(sentenceIdx) == 0)) {
	return {};
	}
	// It is expected that translated words will have at least one word
	std::vector<SubwordRange> wordIndices(/numWords=/1);

	/// The LogisticRegressorQualityEstimator model ignores the presence of the EOS token, and hence we only need to
	/// iterate n-1 positions.
	for (size_t subwordIdx = 0; subwordIdx < (logProbs.size() - 1); ++subwordIdx) {
	ByteRange subword = target.wordAsByteRange(sentenceIdx, subwordIdx);

	const char firstLetter = target.text.at(subword.begin);

	// if the first character is whitespace, it's a beginning of a new word
	if (isspace(firstLetter)) {
	wordIndices.back().end = subwordIdx;
	wordIndices.emplace_back();
	wordIndices.back().begin = subwordIdx;
	}
	}

	wordIndices.back().end = logProbs.size() - 1;

	return wordIndices;
	}

	std::vector<SentenceQualityScore> getQualityScores(const Response& response) {
	std::vector<SentenceQualityScore> scores;
	scores.reserve(response.qualityScores.size());

	for (size_t sentenceIdx = 0; sentenceIdx < response.qualityScores.size(); ++sentenceIdx) {
	std::vector<ByteRange> wordByteRanges;
	wordByteRanges.reserve(response.qualityScores[sentenceIdx].wordRanges.size());

	for (auto&& word : response.qualityScores[sentenceIdx].wordRanges) {
	wordByteRanges.emplace_back();
	wordByteRanges.back().begin = response.target.wordAsByteRange(sentenceIdx, word.begin).begin;
	wordByteRanges.back().end = response.target.wordAsByteRange(sentenceIdx, word.end).begin;
	}

	scores.emplace_back(SentenceQualityScore{response.qualityScores[sentenceIdx].wordScores, std::move(wordByteRanges),
	response.qualityScores[sentenceIdx].sentenceScore});
	}

	return scores;
	}

Store subword indices for quality scores in Response #357

Store subword indices for quality scores in Response #357

Conversation

jelmervdl commented Feb 17, 2022

jerinphilip left a comment

Choose a reason for hiding this comment

jerinphilip Feb 18, 2022 • edited Loading

Choose a reason for hiding this comment

kpu Feb 18, 2022

Choose a reason for hiding this comment

jerinphilip Feb 18, 2022

Choose a reason for hiding this comment

jelmervdl Feb 18, 2022

Choose a reason for hiding this comment

jelmervdl commented Feb 25, 2022

jerinphilip Feb 18, 2022 •

edited

Loading